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This paper considers the maximum likelihood estimation of factor 
models of high dimension, where the number of variables (N) is com- 
parable with or even greater than the number of observations (T). An 
inferential theory is developed. We establish not only consistency but 
also the rate of convergence and the limiting distributions. Five differ- 
ent sets of identification conditions are considered. We show that the 
distributions of the MLE estimators depend on the identification re- 
strictions. Unlike the principal components approach, the maximum 
likelihood estimator explicitly allows heteroskedasticities, which are 
jointly estimated with other parameters. Efficiency of MLE relative 
to the principal components method is also considered. 

1. Introduction. Factor models provide an effective way of summarizing 
information from large data sets, and are widely used in social and physical 
sciences. 3 There has also been advancement in the theoretical analysis of 
factor models of high dimension. Much of this progress has been focused 
on the principal components method; see, for example, [5, 7, 22] and [23]. 4 
The advantage of the principal components method is that it is easy to com- 
pute and it provides consistent estimators for the factors and factor loadings 
when both iV and T are large. The principal components method implicitly 
assumes that the idiosyncratic covariance matrix is a scalar multiple of an 
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identity matrix. While the method is robust to heteroscedasticity and weak 
correlations in the idiosyncratic errors, there are biases associated with the 
estimates. In fact, if N is fixed, the principal components estimator for the 
factor loadings is inconsistent, as shown in [5], except under homoscedastic- 
ity 

In this paper, we consider the maximum likelihood estimator under the 
setting of large N and large T. The maximum likelihood estimator (MLE) 
is more efficient than the principal components method. In addition, MLE is 
also consistent and efficient under fixed N and large T because this setting 
falls within the framework of classical inference; see, for example, [3] and [18]. 

Our estimator coincides with the classical factor analysis. However, the 
statistical theory does not follow from the existing literature. Classical in- 
ferential theory is based on the assumption that N is fixed (or one of the 
dimensions is fixed). This assumption, to a certain extent, runs counter to 
the primary purpose of factor analysis, which is to explain the commonality 
among a large number of variables in terms of a small number of latent 
factors. Let M zz = Ylt=i( z t — z)( z t — be the data matrix of N x N 
and let T, zz (6) = E(M ZZ ). A key assumption in classical inference is that 
v / Tvech(M 2Z — £ 22 (#)) is asymptotically normal with a positive definite 
limiting covariance matrix, as T — > oo. This assumption does not hold as N 
also goes to infinity. The asymptotic normality is not well defined with an 
increasing dimension. For example, if N > T, M zz is a singular matrix, so it 
cannot have a normal distribution with a positive covariance matrix. Fur- 
thermore, the dimension of the unknown parameters (denoted by 9) is also 
increasing as N increases. The usual delta method (Taylor expansion) for 
deriving the limiting distribution of the MLE of 9 will not work. Therefore, 
the high-dimensional inference for MLE requires a new framework. 

Fixing ./V is for the purpose of tractability for theoretical analysis. Such 
an assumption is unduly restrictive. Many applications or theoretical models 
involve data sets with the number of variables comparable with or even 
greater than the number of observations; see [11, 20, 22] and [23]. Although 
the large- N analysis is demanding, the limiting distribution of the maximum 
likelihood estimator has a much simpler form under large N than under 
fixed N. 

There exists a small literature on efficient estimation of factors and factor 
loadings under large N. [9] considers a two-step approach by treating both 
the factors and the factor loadings as the parameters of interest. [12] also 
considers a two-step approach. The first step uses the principal components 
method to obtain the residuals and the second step uses a feasible gener- 
alized least squares. This method depends on large N and large T to get 
consistent estimation of the residual variances. MLE is considered by [14]. 
A certain average consistency is obtained; [14] does not consider consistency 
for individual parameters nor the limiting distributions. 
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The present paper is the first to develop a full statistical theory for the 
maximum likelihood estimator. Our approach is different from the existing 
literature. The challenge of the analysis lies in the simultaneous estima- 
tion of the heteroscedasticities and other model parameters. To estimate 
the heteroscedasticity, the maximum likelihood estimator does not rely on 
estimating the individual residuals, which would be the case for two-step 
procedures. Using residuals to construct variance estimators will be incon- 
sistent when one of the dimension is fixed. The MLE remains consistent 
under fixed N. 

The rest of this paper is organized as follows. Section 2 introduces the 
model and assumptions. Section 3 considers a symmetrical presentation for 
factor models. Identification conditions are considered in Section 4. Consis- 
tency and limiting distributions are derived in Section 5. Section 6 considers 
the estimation of factor scores. Section 7 compares the efficiency of the MLE 
relative to the principal components method and Section 8 discusses com- 
putational issues. The last section concludes. Proofs of consistency are given 
in the Appendix and additional proofs are provided in the supplement [6]. 
Throughout the paper, the norm of a vector or matrix is that of Frobe- 
nius, that is, ||^4|| = [tr(^4'^4)] 1//2 for vector or matrix A; diag(^4) represents 
a diagonal matrix when A is a vector, but diag(^4) can be either a matrix 
or a column vector (consisting of the diagonal elements of A) when A is 
a matrix. 

2. Factor models. Let N denote the number of variables and T the sam- 
ple size. For i = 1, . . . , N and t = 1, . . . , T, the observation zu is said to have 
a factor structure if it can be represented as 

(2.1) zn = a i + X' i f t + e it , 

where f t = (fu,ft2, • • • , ftr)' and A; = (Xa, X ir )'; both are r x 1. Let A = 
(Ai, A2, • • • , Ajy)' be JV x r, and z t = (z±t, ■ ■ ■ , z^t)' be the N x 1 vector of 
observable variables. Let et and a be similarly defined. In matrix form, 

(2.2) z t = a + Af t + e t . 

The vector z% is observable; none of the right-hand side variables are observ- 
able. We make the following assumptions: 

Assumption A. {f t } is a sequence of fixed constants. Let Mff = y x 
Ht=i(ft - f)(ft - f Y be the sample variance of f t where / = ^J2t=ift- 
There exists an Mff > (positive definite) such that Mff = lim.T~^ooMff. 

Assumption B. E(e t ) = 0; E(e t e' t ) = S ee = diag(o-f ,of, . . . ,cr%); 
E{ef t ) < C 4 for all i and t, for some C < 00. The en are independent for 
all i and t, and the JV x 1 vector et is identically distributed over t. 
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Assumption C. There exists a positive constant C large enough such 
that: 

(C.l) IIAjH <C for all j. 
(C.2) C~ 2 <a]<C 2 for all j. 

(C.3) The limits lim^oo iV" 1 A'S" 1 A = Q and lim^oo ^ E^i °^(^ ® 
Aj)(A- ® A^) = Q exist, where Q and O are positive definite matrices. 

Assumption D. The variances crj are estimated in the compact set 
[C~ 2 , C ]. Furthermore, Mft is restricted to be in a set consisting of all 
semi-positive definite matrices with all elements bounded in the interval 
[— C, C], where C is a large constant. 

In Assumption A, we assume ft is a sequence of fixed constants. Our 
analysis holds if it is a sequence of random variables. In this case, we assume 
ft to be independent of all other variables. The analysis can then be regarded 
as conditioning on {ft}- Without loss of generality, we assume that / = 
7^ J2t=i ft = [or E{ft) = for random factors] because the model can be 
rewritten as z t = a + A/ + A(f t - f) + e t = a* + A/* + e t with a* = a + A/ 
and ft = ft~f- In Assumption B, we assume et to be independent over 
time. In fact, our consistent result still holds if et are serially correlated and 
heteroscedastic over time or eu are correlated over i, provided that these 
correlations are weak (sufficient conditions are given in [2]). The limiting 
distribution then would need modification. For simplicity, we shall consider 
the uncorrelated case. The analysis of the maximum likelihood estimation 
under high dimension is already difficult; allowing correlation will make the 
analysis even more cumbersome. We will report the results under general 
correlation patterns in a separate paper. Assumption D is for theoretical 
analysis. Like all nonlinear (nonconvex) analysis, parameters are assumed 
to be in a bounded set. 

The second moment of the sample, denoted by M zz , is 

1 T 

(2-3) M zz = -Y J {zt-z){z t -z)\ 

t=i 

where z = T~ l Ylt=i z t- Note that the division by T instead of T — 1 is for 
notational simplicity. Let Y> zz be 

(2.4) S Z2 = AM // A' + S ee . 

The objective function considered in this paper is 

(2-5) lnL = -^ln|S zz | - ^tr(M„^). 

The above objective function may be regarded as a quasi likelihood function. 
To see this, assume ft is stochastic with mean zero and variance Sjj. From 
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zt = a + Aft + et, the variance matrix of z t , denoted by Ti zz , is 

S zz = ASjjA' + S ee . 
So the quasi likelihood function (omitting a constant) can be written as 

1 1 T 

t=i 



1 1 



2Ar , «, 2Arr 



2iVT 

t 



a) 



Clearly a minimizes the likelihood function at z. So the concentrated quasi 
likelihood function can now be written as 



In L = lnl S-2 1 tr 

2N 1 1 2N 



1 T 

j;^^ - z)(z t - z)'T,J 
t=i 



= -^ln|S zz |-^tr(M Z2 S- 1 ), 

which is the same as (2.5) except that Y,ff is in place of Mfj. Because 
the factors are fixed constants instead of random variables, as stated in 
Assumption A, it is natural to use Mff rather than Tiff in (2.4) and (2.5). 

If both A and F = (/i, /2, • • • , fr)' are treated as parameters, the corre- 
sponding likelihood function is 

T 

(2.6) - — ln|E ee | - — - - a - A/O'E" 1 ^ -a- A/ t ). 

t=i 

Since A has Nr parameters and F has Tr parameters, the number of pa- 
rameters to be estimated will be very large, which leads to efficiency loss. 
In contrast, the number of parameters in (2.5) is only N(r + 1) + r(r + l)/2, 
which is considerably smaller than the number of parameters in (2.6), which 
is N(r + 1) + Tr. The difference is pronounced for small N but large T. In 
fact, when estimating A, F and T ee , the global maximum likelihood estima- 
tor does not exist. It can be shown that the likelihood function diverges to 
infinity by certain choice of parameters (see [2] , page 587) . 

By restricting T ee = ij\r (an identity matrix), the MLE estimator of (2.6) 
becomes the principal components estimator. That is, the principal com- 
ponents method minimizes the objective function Ylt=i( z t ~ a ~ ^~ft)'( z t ~ 
a — Aft) over a, A and F. The estimators cannot be efficient when het- 
eroscedasticity actually exists. 
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Even though the ft are fixed constants, we avoid directly estimating ft- 
Instead we only estimate the sample moment of ft- This considerably re- 
duces the number of parameters and removes the corresponding incidental 
parameters bias. The estimator is also consistent under fixed N, since the 
setting falls back to the classical factor analysis. 

By maximizing (2.5), in combination with (2.4), we can obtain three first- 
order conditions (see, e.g., [18]): 

(2.7) A'±J z 1 (M zz -± zz ) = 0, 

(2.8) diag(Sj z 1 ) = diag(S- 1 M 22 S" 1 ), 

(2.9) I't-fA = k'±-}M zz ±-jk, 

where A, Mfj and £ ee denote the MLE and £ 22 = KMjf K' + E ee . 

Condition (2.7) is derived from the partial derivatives with respect to A, 
(2.8) is derived with respect to the diagonal elements of S ee , and (2.9) is 
derived with respect to Mff. Equation (2.9) can be obtained from (2.7) 
by post-multiplying T,~ Z A. Since (2.9) is redundant, in order to make the 
system of three equations solvable, we need to impose further restrictions. 
These identification restrictions will be discussed in Section 4. 

3. Symmetry and choice of representations. Consider the model 

zu = 5 t + X'ift + eu- 

Let Zi = (zn,z i2 , . . . ,z iT )', S = (Si,..., St)', F = (fi,f 2 , ■ ■ ■ ,/t)' and e; = 
(eii,...,e iT y; then 

Zi = 5 + FXi + a 

(i = l,2,...,N). Define 
1 - 

M zz = - J2( Zi - z)( Zi - z)\ S 22 = FM XX F' + Et e) 

i=l 

where Sj e = diag(cr^, . . . , a\) and 

1 - 

1=1 

is the r x r sample variance of the factor loadings. 

Although we use the same notation of M 22 and E 22 , they are now T x T 
matrices instead oi N x N . The matrix sj e contains idiosyncratic variances 
in the time dimension (time series heteroscedasticity) . The quasi maximum 
likelihood estimator maximizes the likelihood function 

lnL = --Ln|£ 2 ,| - trCM^E" 1 ). 
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This representation avoids estimating Ai, A2, • • • , Ajy directly, but only the 
sample moment of Aj. The representation has Tr + T + r(r + l)/2 num- 
ber of parameters. If N is much larger than T, this representation will give 
more efficient estimation. In particular, if T is fixed, we can only use this 
representation to get consistent estimation of /i,/2,---,/t an d sL- This 
representation will also be useful if one is interested in estimating the het- 
eroscedasticity in the time dimension. 

The analysis of one representation will carry over to the other by switch- 
ing the role of N and T and the role of A and F. So it is sufficient to carefully 
examine one representation. Bearing this in mind, our analysis focuses on 
the representation in the previous section. The objective function in (2.5) 
involves fewer parameters when N is less than T, although we make no as- 
sumption about the relative size between N and T (except for Theorem 6.1), 
and in particular, N is allowed to be much larger than T. 

4. Identification conditions. It is well known that the factor models are 
not identifiable without additional restrictions. For any r x r invertible ma- 
trix R, we have AMffA' = A%A' where A = AR and M ff = R^MffR'" 1 . 
Thus observationally equivalent models are obtained. In order to uniquely 
fix A and Mff given AMffA', we need r 2 restrictions since an invertible r x r 
matrix has r free parameters. For details of identification conditions, read- 
ers are referred to [18] and [4]. There are many ways to impose restrictions. 
In this paper, we consider five identification strategies which have been used 
in traditional factor analysis. These restrictions are listed in Table 1. The 
left pane is for the representation in Section 2, while the right pane is for 
the representation in Section 3. 

We make some comments on these restrictions. Given AMffA', IC1 will 
uniquely fix A and Mff. So full identification is achieved. But this is not 
the case for IC2. If we change the sign of any column of A, AMffA' is not 
changed. This implies that we only identify A up to a column sign change. 

Furthermore, if we switch the positions between the ith and jth columns 
of A, and the positions between the ith and jth diagonal elements of Mff, 
the matrix AMffA' is not changed. This means that we need restrictions 
on the ordering of the diagonal elements of Mff. In this paper, we assume 
that the diagonal components of Mff are arranged from the largest to the 
smallest and they must be distinct and positive. Because of this restriction, 
we naturally require that the diagonal elements of estimator Mff are also 
arranged in this order, which is important for the proof of consistency. 

Under IC3, for the same reason, we assume that the diagonal elements of 
■^A'S^A are distinct and positive, and are arranged in decreasing order; 
A is identified up to a column sign change. 

IC4 imposes ^r(r + 1) restrictions on the factor loadings, and ^r(r — 1) 
restrictions on the factors. Identification is fully achieved like IC1. 



oo 



Table 1 
Identifying restrictions 



Restrictions on F 



Restrictions on A 



Restrictions on A 



Restrictions on F 



IC1 
IC2 

IC3 

IC4 



IC5 



Unrestricted 

Mf f = diagonal 
(with distinct elements) 

Mff = I r 

Mf f = diagonal 



M 



ff 



A=(i r ,A' 2 y 

j r A'S- 1 A = I r 



jjA'^eJ-A = diagonal 
(with distinct elements) 

A = (Ai,Ai)' 
/ 1 



Ai = 



Ai: 



A21 1 ■ ■ 

A' = (Ai,A 2 )' 
/An ••• 
A21 A22 



1/ 



y A r i A r 2 • • • A rr J 
A«^0, j = l,2,...,r 



IC1' 
IC2' 

IC3' 

IC4' 



IC5' 



Unrestricted 

M\\ — diagonal 
(with distinct elements) 

M xx = I r 



M\\ — diagonal 



A/aa = It 



F = (Ir,Fl)' 
±,F'Y,\- 1 F = I r 



^F''El~ 1 F = diagonal 
(with distinct elements) 

F = (F{,Fl)' 
( 1 ■■■ 0\ 

/21 1 ■■■ 



Fi = 



\ /rl Jr2 

F' = (F{,Fi)' 

(hi ••• 

/21 /22 



V frl fr2 

/«#0,i = 1,2, 



1/ 

r 

/rr / 

. ;■ 



a 

> 

> 
o 
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Under IC5, we can only identify A up to a column sign change. In addition, 
we need nonzero diagonal elements for the lower triangular matrix. The 
reason is intuitive. If the ith diagonal element is zero, both the ith and 
{i + l)th columns will share the same structure. 

IC1 is related to the measurement error problem; it assumes that the 
first r observations are noise measurements of the underlying factors. IC2 
and IC3 are the usual restrictions for MLE; see [18]. IC4 and IC5 assume 
a recursive relation: the first factor affects the first variable only, and the 
first two factors affect the first two variables only, and so on; they are widely 
used, for example, [4] and [16]. Clearly, IC1, IC4 and IC5 require a careful 
choice of the first r observations in practice. The inferential theory assumes 
that the underlying parameters satisfy the restrictions, implying different Aj 
under different restrictions. 

5. Asymptotic properties of the likelihood estimators. Since the number 
of parameters increases as N increases, the usual argument that the objec- 
tive function converges in probability to a fixed nonrandom function and the 
function achieves its maximum value at the true parameter values will not 
work. This is because as N and T increase, there will be an infinite number 
of parameters in the limit. Our idea of consistency is to obtain some average 
consistency, and then use these initial results to obtain consistency for indi- 
vidual parameters. Even the average consistency requires a novel argument 
in the presence of an increasing number of parameters. 

PROPOSITION 5.1. Let be the MLE by maximizing (2.5), where 9 = 
(Ai, . . . , Ajv ; <5"? ; . . . Mff). Under Assumptions A-D, when N,T — > oo, 
with any one of the identification conditions IC1-IC5, we have 

1*1. 

( 5 - la ) a7^F I|Xj 

8=1 * 

(5.1b) if>? 

i=l 

(5.1c) M ff 

Establishing the above result requires a considerable amount of work. De- 
veloping and identifying appropriate strategies have taken an even greater 
amount of efforts. The difficulty lies in the problem of infinite number of pa- 
rameters in the limit and the nonlinearity of objective function. The infinite 
number of parameters problem in this paper is fundamentally different from 
those in the existing literature. For example, consider an AR(oo) process 



-o?) 2 4o, 
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Xt = ^2'jLi a j £ t-j- Although there exist an infinite number of parameters 
{aj}JL 1 , the assumption that aj — > 0, as j — > oo, effectively limits the num- 
ber of parameters. For example, cL = is consistent for aj for j > ln(T). 
The assumption that aj —> may be viewed as one form of smoothing re- 
striction. However, in the present context and in the absence of any form of 
smoothness, all parameters are free parameters, and there will be an infinite 
number of them in the limit. This is the source of difficulty. 

While there is also an infinite number of parameters problem in the anal- 
ysis of the principal components (PC) estimator, the method does not es- 
timate heteroscedasticity, and it minimizes an objective function stated in 
Section 2. Its degree of nonlinearity is much less than the likelihood func- 
tion (2.5). It is the joint estimation of heteroscedasticity that makes the 
analysis difficult. In the Appendix, we provide a novel proof of consistency, 
which constitutes a departure from the usual analysis, say, in [19] and [24]. 

The proofs of (5.1a) and (5.1c) depend heavily on the identification con- 
ditions. If we denote A = (A - A)'E~ 1 A(A / S7 e 1 A)" 1 , the proof of consis- 
tency centers on proving A — > 0. However, the proof of A — > is quite 
different with different identification conditions. Under IC2, for example, 
(A'S^A/AQ -1 = I r . Under other identification conditions, the proof of even 
(k't-yk/N)- 1 = Op(l) is extremely demanding. Under IC2, IC3 and IC5, 
we need to assume that the estimator A has the same column signs as those 
of A in order to have consistency. Having the same column signs is regarded 
as part of the identification restrictions under IC2, IC3 and IC5. 

In order to derive the inferential theory for the estimated parameters, we 
need to strengthen Proposition 5.1. We state the result as a theorem: 

Theorem 5.1. Under the assumptions of Proposition 5.1, we have 
1 N 1 

(5.2a) _^_||A J -A,|| 2 = O p (T- 1 ), 

i=l % 
1 N 

(5.2b) _^2_ (7 2 ) 2 = 0p(T -l )) 

i=l 

(5.2c) \\M ff -M ff \\ 2 = O p (T- 1 ). 

It is interesting to compare our results with those in classical factor anal- 
ysis. If A" is fixed, the existing literature has already shown that Xj and <7 2 
converge to Aj and a 1 - at the rate of y/T for any j. Since A^ is fixed, the 
classical result implies (5.2a) and (5.2b). In fact \\M ff -M ff \\ 2 = O p (T" 1 ) 
holds also since it can be derived from the first two (the results analo- 
gous to (5.2c) under IC1 when A^ is finite can be seen in [1]). Theorem 5.1 
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shows that these results still hold in the large-iV setting despite estimat- 
ing an increasing number of elements. However, we point out that the rate 
stated in Theorem 5.1 is not the sharpest. If IC2 or IC3 is adopted as 
identification conditions, then ||M/y — Mff\\ 2 = O p (T _1 ) can be refined as 
\\Mff -M ff \\ 2 = OpiN^T- 1 ) + O p (T~ 2 ). Because (5.2c) is sufficient for the 
inferential theory to be developed, we only state this general result. 

As pointed out earlier, the behavior of A = (A — A) / S~ e 1 A(A'S~ 1 A) _1 is 
important in establishing consistency. In fact, matrix A plays a key role 
in the inferential theory as well. The convergence rate of A depends on 
identification conditions. We use A^ in place of A under ICk (k= 1, 2, . . . , 5). 
Under IC2 and IC3, the convergence rate of A is min (VNT,T). However, 
under other sets of identification conditions, the convergence rate of A is VT. 
This difference in convergence rate affects the limiting distribution of Mff, 
which also makes the limiting distributions of Xj different. 

In Section C of the supplement [6], we give the asymptotic representa- 
tions of VT(Xj — Xj) under IC1-IC5. The main representations are given 
in (C.5), (C.12), (C.17) and (C.24), respectively. The following theorem is 
a consequence of these representations. 

Theorem 5.2. Under the assumptions of Proposition 5.1, for each j = 
1,2, . . . , N , as N,T — > oo, we have: 
Under IC1, 

(5.3a) VTCXj - Xj) -4AA(0,(M // )- 1 (A;-S eer A J + a 2 )). 

Under IC2 or IC3, 

(5.3b) VT(Xj - Xj)AM{0, (Mff)- 1 ^). 

Under IC4, for j > r 

(5.3c) Vf(Xj - X j )^M(0,U + (M ff )- 1 a 2 ) 

and for 2 < j < r 

Vf(Xj - A,) A AA(0,n + 2P-\M }f )- l a 2 3 - (M//)" 1 ^ 2 ). 
Under IC5, for j > r 

(5.3d) Vf(Xj-Xj)^M(0,E + I r a 2 ) 

and for l<j<r 

Vf(Xj - Xj) A Af(0, E + 2I j r a 2 - I r a 2 ), 

where U = (A^/ r )D(M // )$ J D(M // )'(A j ® I r ), E = {X' j ®I r )DTD'{Xj ®I r ). 

S eer is an r x r diagonal matrix with the jth diagonal element a 2 ; Ir is an 
r x r diagonal matrix with the first j diagonal elements being 1 and the rest 
being 0. The meanings of D(M fj),D,& andT are explained below. 
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The matrix D(M) (with M = M tf) is a generalized duplication matrix 
of r 2 x \r{r + 1) depending on the diagonal matrix M; D(M) can be con- 
structed row by row in the following way. Given the number k, 1 < k < r 2 , we 
denote j = [(k — l)/rj + 1 and i = k — (j — l)r, where [-J denotes the largest 
integer no greater than the argument. If i > j, all elements of the kth row are 
zero, except that the (^(2r — j + 2)(j — j + l)th element is 1; if i < j, all 

elements are zero, except that the (i(2r — i + 2)(i — 1) — i + j + l)th element 
is —uijin^ 1 , where rrij is the jth diagonal element of M. The r 2 x \r(r — 1) 
matrix L> under IC5 is also a generalized duplication matrix. Let A be 
a skew-symmetric matrix and let veck(A) be the operator that stacks the 
elements of A strictly below the diagonal into a vector (excluding diagonal 
elements). Then D is defined as vec(A) = Dveck(A). 

Here are some examples for D(M). If M is a scalar, then D(M) = 1. If 
M = diag(mi, 7712), then 


1 



D(M) 



If M = diag(mi,7Tt2, ITI3), then 






Vo 



-7712/7711 




D(M) 



/I 














Vo 





1 



-7772/7771 











1 





-7773/7771 










1 














1 



-7773/7772 




°\ 









1/ 



Here are some examples for D, which only depends on the dimension 
of M*f (i.e., the number of factors). If r = 1, then D = 0. If r = 2, then 
D = (0, 1,-1,0)'. If r = 3, then 



D 



( 





^ 


1 











1 





-1 























1 





-1 











-1 


V 





) 
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The matrix $ in (5.3c) is the limiting covariance of vec^A^), where A4 is 
the A matrix defined earlier under IC4. The asymptotic representation of A^ 
is given by (C.15) in Section C of the supplement [6]. From this asymptotic 
representation, the elements of $ can be easily computed and are given 
by (C.16). The matrix T in (5.3d) is the limiting covariance of veck(A^), 
where A$ is the matrix A under IC5, and is asymptotically skew-symmetric. 
The asymptotic representation of A§ is given by (C.22). The elements of T 
are determined by (C.23) in Section C of the supplement [6]. 

Remarks: In classical factor analysis, the MLE usually imposes the restric- 
tion of IC3. The limiting distribution Xj in classical factor analysis (fixed N) 
is very complicated; see [18]. Most textbooks on multivariate statistics do not 
even present the limiting distributions owing to its complexity. As pointed 
out by Anderson ([2], page 583), the limiting distribution is "too compli- 
cated to derive or even present here." In contrast, the limiting distribution 
under IC3 with large N is 

VTCXj-Xj) 4aT(o,(m // )-V]). 

This is as efficient as the case in which the ft are observable. For if the ft 
are observable, the estimator of Xj by applying OLS to (2.1) is A° ls = 

(T- 1 Ztiift ~ f)(ft ~ fyrHT- 1 Etiift - f)(*jt - %))■ It ^ easy to show 
that VT(Xf s - Xj) 4aT(0, (M//)" 1 ^ 2 ), the same as the MLE estimator un- 
der IC2 and IC3. The OLS estimator is the best linear unbiased estimator 
under Assumption B, as the error terms of the regression equation are i.i.d. 

Now we state the limiting distributions of M// and a 2 - for each j. In Sec- 
tion C of the supplement [6], we derive the asymptotic representations for 
Mff — Mff under IC1, IC2 and IC4, respectively, which are given by (C.7), 
(C.ll) and (C.19). The asymptotic representation for dj—oj is given by (C.4) 
The next two theorems follow these asymptotic representations. 

Theorem 5.3. Under the assumptions of Proposition 5.1, we have: 
Under IC1, 

VTvech(M ff - Mff) A A/"(0, 4D+(£ eer ®M ff )D+'). 
Under IC2, N/T — > and normality of en, 
v / iVTdiag(M // - Mff) 

AN{^J r [2(I r ®Mff)n{I r ®Mff) + A{Q®Mff)}J' r ). 
Under IC4, 

v T.liagi .V//, - Mff) 4 A^O^KAiE-^Ai)- 1 ®M f f] J' T ), 
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where is the Moore-Penrose inverse of the duplication matrix D r ; J r 
is an r x r 2 matrix, which satisfies, for any r x r matrix M, diag{M} = 
J r vec(M), where diag{-} is the operator which stacks the diagonal elements 
into a vector. 

Note under IC3 and IC5, Mff is known and thus not estimated. Normal- 
ity under IC2 is used only for calculating the limiting variance. Given the 
asymptotic representation of Mff — Mff, it is easy to derive the limiting 
distribution under nonnormality. 

Theorem 5.4. Under the assumptions of Proposition 5.1, with any set 
of the identification conditions, we have 

(5.4) Vf(a] - a)) 4^(0,^(2 + Kj )), 

where Kj is the excess kurtosis of ejt- Under normality of en, the limiting 
distribution becomes M(0,2aj). 

Our analysis assumes that the underlying parameters satisfy the identi- 
fication restrictions, which is also the classical framework of [18]. A conse- 
quence is that we are directly estimating the underlying true parameters 
instead of rotations of them. The rotation matrix used in [23] and [7] degen- 
erates into an identity matrix. This result itself is interesting. 

6. Asymptotic properties for the estimated factors. The factors ft can 
be estimated by two different methods. One is the projection formula and the 
other is the generalized least squares (GLS). These methods are discussed 
in [2]. 

If the factor ft is normally distributed with mean zero and variance S//, 
and is independent of et, then the joint distribution of (ft,zt), by (2.2), can 
be written as 

(6-1) \ ft 

Zt 

Given zt, the best predictor of ft, ff, is ff = S/jA^ASj/A' + See) -1 ^ — a). 
By the basic result (AS// A' + See)" 1 = S" 1 - S~ 1 A(SJ / 1 + A'S^A)" 1 x 
A'S" 1 , we have 

(6.2) ff = (Ejj + A'S^A^A'S- 1 ^ - a). 

Although (6.2) is deduced under the assumption of normality of ft and in 
this paper the ft are fixed constants, equation (6.2) can still be used to 
estimate ft by replacing the parameters with their corresponding estimates. 
So the estimator f t is 

(6.3) ft = (M7, 1 + A'S^A^A'S- 1 ^ - z). 



N 



'AS// AS//A' + S e 
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An alternative procedure is the GLS. If the Xj and <r| are observable, the 
GLS estimator of ft is (A'E~ A) A'£~ (zt — z). The unknown variables 
can be replaced by their estimates. We define the GLS estimator of ft as 

(6.4) f t = (A't~ e 1 kr 1 A't- e \z t -z). 

Under large N, not much difference exists between (6.3) and (6.4). In fact, 
they are asymptotically equivalent and have the same limiting distributions. 
But for relatively small N, the difference may not be ignorable. 

Proposition 6.1. Under the assumptions of Proposition 5.1, ft = ft + 
O p (l/N). 

Since ft and ft have the same limiting distribution, we only state the 
distribution for (6.4). In Section D of the supplement [6] we derive the 
asymptotic representations for y/~N (ft — ft), which are given by (D.3), (D.4) 
and (D.5), respectively. From these representations, we obtain: 

Theorem 6.1. Let A e [0, oo) . Under the assumptions of Proposition 5.1 
and y/~N /T — > 0, we have: 
Under IC1 and N/T A, 

(6.5a) VN(f t - ft) 4aT(0, AfXMff)- 1 ^^ + Q" 1 ). 

Under IC2, 

(6.5b) ^N(f t -f t )AM(0,I r ). 
Under IC3, 

(6.5c) ^(/t-ZtJ-^jV^Q- 1 ). 
Under IC4 and N/T -> A, 

y/N(ft ~ ft) 

(6.5d) 

A AA(0, A(J r ® fl)D(M ff )$D'(M ff )(I r ® / t ) + Q" 1 ). 
f/nder IC5 and A^/T -> A, 

(6.5e) >/jV(/ f - / t ) 4 AT(0, A(J r ® f' t )DYD\l r ® f t ) + Q" 1 ). 

T/ie matrix Q is defined in Assumption C. The matrices S eer , &,T,D(Alff) 
and D are defined in Theorem 5.2. 

If A = 0, Theorem 6.1 shows that ft has the same limiting distribution 
regardless of the identification restrictions. In this case, the variance is equal 
to Q = limjv^oo A~ 1 A'S~ e 1 A. Note that under IC2, Q = I r . Irrespective of 
whether A is zero, ft is efficient under IC2 and IC3 in the sense that the lim- 
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iting variance coincides with the situation in which both the factor loadings 
and the variances o\ are observable and GLS is applied to a cross-sectional 
regression for each fixed t. This result requires y/N /T — > 0. 

Recall that a factor model has two symmetrical representations. We also 
discussed earlier which presentation should be used in practice. If N is 
smaller than T, we should estimate the factor loadings by the maximum 
likelihood method because this representation has fewer number of param- 
eters. The opposite is true if T is smaller than N. This intuitive argument 
is borne out by the results of Theorems 5.2 and 6.1. By Theorem 5.2, the 
magnitudes of N and T do not affect the limiting covariance of Xj (other 
than the rate of convergence) but they do affect the limiting covariance of ft 
as shown by Theorem 6.1. If A is large, ft cannot be estimated well under 
IC1, IC4 and IC5. Note that ft is not the maximum likelihood estimator. In 
this case, we can use the representation of Section 3 and directly estimate ft 
by the maximum likelihood method. 

7. Comparison with the principal components method. The method of 
principal components (PC) does not assume a factor model, and the method 
is usually regarded as a dimension reduction technique. But PC can be used 
to estimate factor models under large N and large T; see [7, 11, 13] and [22]. 
Let /V° c and ff c denote the PC estimators for Xj and ft, respectively. The 
results of [5] and [8] imply the following asymptotic representation for the 
principal components estimators. As N, T — > oo, if y/N /T — > 0, then 

1_2\ 



■Mifd^Mf^a 



j 



and if VN/T^O, then 

^v(/f - ft) = ( I E a,a: ) (-±=£ A ^ ) + °p0) 




4aa(o,(m aa )- 1 t(m aa )~ 1 ), 

where M AA = lim^oo £\ =1 A;A'- and T = lim^-*.^ j? Y^i=i ^iK a h 

The PC estimator in [5] uses the identification restriction IC3. Under IC3, 
we already show that the MLE satisfies, as N,T — >■ oo, 

VT(A i -A i )4jV(0 > (M // )-V?). 

Theorem 6.1 above shows that, under IC3, y/N(f t - f t ) 4 jV(0, Q" 1 ). This 
result requires \fN /T — > 0, which is satisfied if N/T — > A. 
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While \j and Xj have the same limiting distribution, ff c is less effi- 
cient than ft. This follows because the sandwich form of the covariance 
matrix (M \\)~ 1 T(M aa) 1 is no smaller than Q~ l , where Q is the limit of 
jf^Z i= l x AjA^. Moreover, under IC3, the MLE Xj only requires N,T — » 

oo, but the PC estimator Xj requires an additional assumption that y/T/N — > 
0. Furthermore, the maximum likelihood estimator Aj is consistent under 
fixed N , but Aj° requires both N and T to be large in order to have consis- 
tency. Of course, under fixed N, the limiting distribution of MLE will have 
a different (more complicated) asymptotic covariance matrix; see [18]. 
To estimate <r|, the PC method would need to estimate the individual 

residuals eu = Xu — 6\ — (A? )'/^ and then construct <r| = ^ Ylt=i &jf ^ n 
case that N is fixed, ft cannot be consistently estimated, so eu is inconsistent 
for Bit- This further implies that crj is inconsistent for <r|. In comparison, 
the MLE does not estimate the individuals in- The variances are estimated 
jointly with the factor loadings Xj and with the matrix Mff. The variance 
estimator remains consistent under fixed N. 

Finally, the PC estimator for Aj satisfies (see [5]) YliLi ll^f C ~~ ^«ll 2 = 
Opijf) + O p (±), while the MLE for \ t satisfies ^ Y%=i \\ X i ~ -MP = °p(t)- 

8. Computational issues. The maximum likelihood estimation can be 
implemented via the EM algorithm and is considered by [21]. The EM al- 
gorithm is an iterated approach. To be specific, consider the identification 
condition IC3. Once the estimator under IC3 is obtained, estimators un- 
der other identification restrictions can be easily obtained (to be discussed 
below). Under IC3, we only need to estimate A and S ee since Mff = I r . 

Let 6>( fc ) = (AW,E^) denote the estimator at the feth iteration. The EM 
algorithm updates the estimator according to 

-i 



t=\ 



Ij^EiztfllZ^) ±Y.E{f t f t \Z^) 



t=i 



T 

T 

S ( fe e +1 ) = diag(M„ - A^+^AW^EW)- 1 ^ 
where E z k J = A^aW + , and 



i^S(/^|Z,e«)=AC*) , (E«)- 1 JM„(EW)- 1 AC*) 

i=l 

+ / r -A«'(sW)" 1 AW, 

^E{ztfi\Z,9^) = M zz {^)- 1 A^. 
t=i 
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This gives 6»( fc+1 ) = (A( fc+1 \ The iteration continues until ||6>( fc+1 ) - 

0(ty || is smaller than a preset tolerance. In the simulation reported below, we 
use the principal components estimator as the starting value. Let (At, ST) 
denote the final round of iteration. Let V be the orthogonal matrix con- 
sisting of the eigenvectors of -^A T '(sL) _1 A T corresponding to descending 

eigenvalues. Let A = A T V and S ee = S ee . Then 8 = (A, S ee ) satisfies IC3. For 
general models, [25] shows that the EM solutions are stationary points of 
the likelihood functions. For completeness, we provide a direct and simple 
proof of this claim for factor models in the supplement [6] (Section E). 

It is interesting to note that, under large N and large T, the number 
of iterations needed to achieve convergence is smaller than under either 
a small N or a small T. In Section E of the supplement [6], we also explain 
how to write a computer program so it runs fast. 

Let (A, S ee ) denote the MLE under IC3. We discuss how to obtain esti- 
mators that satisfy other identification restrictions. First, note that S ee is 
identical under IC1-IC5. We only need to discuss how to obtain A and Mff. 
Let A* and ML denote the MLE under Id (£ = 1,...,5). Let Ai denote 
the first r x r block of A. For IC1, let A 1 = A(Ai)" 1 and ML = AjA^. 
This new estimator satisfies IC1. For IC2, let A 2 = A^A'S" 1 !) -1 / 2 and 
Mj f = iA'S^A. Then this estimator satisfies IC2. For IC4, M ff = I r is 
known. Let A' x = Q1Z be the QR decomposition of A^ with Q an orthogo- 
nal matrix and 1Z an upper triangular matrix. Define A 4 = AQ. Then A 
satisfies IC4. Finally, consider IC5. Let W be the diagonal matrix with its 
diagonal elements the same as the first r x r block of A 4 . Let A 5 = A 4 W -1 
and Mj f = WW. Then IC5 is satisfied. 

We now consider the finite sample properties of the MLE. Data are gen- 
erated according to zu = Aj/t + eu with r = 2, where Xi, ft are i.i.d. A/"(0, 12) 
and en follows J\f(0,af) with <r 2 = 0.1 + 10 x Ui, and U{ are i.i.d. uniform 
on [0,1]. Adding 0.1 to the variance avoids near-zero values. We consider 
combinations of T = 30, 50, 100 and N = 10, 30, 50, 100, 150. Estimators un- 
der different identification conditions only differ up to a rotation matrix, 
so only IC3 will be considered. We also compute the principal components 
(PC) estimator for comparison. To measure the accuracy between A and A 
(both are ^ x 2), we compute the second (the smallest nonzero) canonical 
correlation between them. Canonical correlation is widely used as a measure 
of goodness-of-fit in factor analysis; see, for example, [14] and [17]. Similarly, 
we also compute the second canonical correlation between F and F. For the 
estimated variances, we calculate the squared correlation between diag(S ee ) 
and diag(Sge). The corresponding values for the principal components esti- 
mators are also computed. Table 2 reports the average canonical correlations 
based on 5000 repetitions for each (N,T) combination. 
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Table 2 
The performance of MLE and PC 



1\T 

ly 


1 




MLE 






PC 




A 


/ ■ 
r 




A 


t 




10 


30 


0.4818 


0.3473 


0.8432 


0.4058 


0.2744 


0.7991 


30 


30 


0.7276 


0.7995 


0.9273 


0.6391 


0.6450 


0.9223 


50 


30 


0.7676 


0.8973 


0.9303 


0.7221 


0.7953 


0.9302 


100 


30 


0.7874 


0.9555 


0.9308 


0.7679 


0.9006 


0.9312 


150 


30 


0.7941 


0.9719 


0.9310 


0.7823 


0.9347 


0.9315 


10 


50 


0.6080 


0.4153 


0.8951 


0.4875 


0.2975 


0.8187 


30 


50 


0.8383 


0.8407 


0.9583 


0.7751 


0.7113 


0.9499 


50 


50 


0.8589 


0.9161 


0.9590 


0.8306 


0.8341 


0.9569 


100 


50 


0.8722 


0.9624 


0.9592 


0.8613 


0.9198 


0.9591 


150 


50 


0.8764 


0.9764 


0.9592 


0.8697 


0.9475 


0.9593 


10 


100 


0.7563 


0.4939 


0.9448 


0.5878 


0.3298 


0.8345 


30 


100 


0.9182 


0.8614 


0.9793 


0.8789 


0.7519 


0.9700 


50 


100 


0.9292 


0.9245 


0.9798 


0.9135 


0.8572 


0.9770 


100 


100 


0.9362 


0.9668 


0.9798 


0.9305 


0.9308 


0.9792 


150 


100 


0.9383 


0.9788 


0.9799 


0.9349 


0.9545 


0.9798 



The results suggest that the precision of A is closely tied to the size of T 
and the precision of F is tied to N. This is consistent with the theory. For 
all (N,T) combinations, the MLE dominates PC. The domination becomes 
less important for N > 50 and T > 50 for estimating factor loadings. But for 
small N, no matter how large is T, MLE noticeably outperforms PC. For 
the estimated factors, there is still noticeable outperformance even under 
large iV and T. These are all consistent with the theory. 

9. Conclusion. In this paper we have developed an inferential theory for 
factor models of high dimension. We study the maximum likelihood estima- 
tor under five different sets of identification restrictions. Consistency, rate 
of convergence and the limiting distributions are derived. Unlike the princi- 
pal component methods, the estimators are shown to be efficient under the 
model assumptions. While both the factor loadings and factors are treated 
as parameters (nonrandom), the key to efficiency is not to simultaneously 
estimate both the factor loadings and the factors. If N is relatively small 
compared with T, the efficient approach is to estimate the individual fac- 
tor loadings (Aj) and the sample moment of the factor scores (ft), not the 
individual scores. The sample moment contains only r(r + l)/2 unknown 
elements. If the factor scores ft are of interest, they can be estimated by the 
generalized least squares in a separate stage. The estimated factor scores 
are also shown to be efficient under the model assumptions. The opposite 
procedure should be adopted if N is much larger than T. In the latter case, 
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we estimate the individual factor scores and the sample moment of the fac- 
tor loadings. If iV and T are comparable, the choice of procedures boils 
down which heteroscedasticity, cross-sectional dimension or the time dimen- 
sion, is the object of interest. The paper also provides a novel approach to 
consistency in the presence of a large and increasing number of parameters. 

APPENDIX: PROOF OF PROPOSITION 5.1 

The following notation will be used throughout: 
H = {k't-Jk)-\ 
H N = N-H = (N^A't^A)- 1 , 

G={Mj}+kt-yA)-\ 

G N = N-G, 

ft = (eu,e 2 t, ■ ■ -^rt)'- 

From (A + B)~ l = A" 1 - A~^B(A + B)~\ we have H = G(I - MjjG)~ l . 
From Y, zz = AM// A' + E ee , we have 

(A.l) E- 1 = S-^ - E- e 1 A(M- 1 + A'E^A^A'E- 1 . 

It follows 

(A.2) 



A'E" 1 = A'Ee" 1 - k!t-Jk{Mj} + A'Ee"e 1 A)" 1 A'S e - 1 



MjjGA't-^. 



To prove Proposition 5.1, we use a superscript "*" to denote the true 
parameters, for example, A*, E* e , ft, etc. The variables without the super- 
script "*" denote the function arguments (input variables) in the likelihood 
function. 

Let 6 = (Ai, . . . , A n , . . . , ct^, M//) and let O be a parameter set such 
that C~ 2 < < C 2 , M// is positive definite matrices with elements bounded. 
We assume 6* = (A*, . . . , A* , af*, . . . , <7 2 * , Mjf) is an interior point of O. For 
simplicity, we also write 9 = (A, E ee ,M//) and 9* = (A*, E* e , M?.). 

Proof of Proposition 5.1. The centered likelihood function can be 
written as 

L(9) = L(9) + R(9), 

where 

L(9) = -1 ln|E„| - 1 tr(E zz (0* )E" 1 ) + 1 + 1 In|E((9*)| 

andi?(0) = -itr((M^-E 2Z (r))E- 1 ). Note that l + ^^i ln l s (^*)l docs 
not depend on any unknown parameters and is for the purpose of centering. 
Lemma A.2 [6] implies that sup e \R(9)\ = o p (l). In particular, we have 

\R{9)\ = o p (l) and \R(0*)\ = o p (l). So \R{9*) - R{9)\ = o p (l). Since 9 maxi- 
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mizes L{6), it follows L{9) + R0)) > L(6*) + R(9*). Hence we have L{9) > 
L(9*) + R(9*) - R{9) > L{9*) - \o p (l)\. However, the function L{9) achieves 
its maximum at 6*, so L{9) < L(8*). Since L(9*) is normalized to zero, we 
have 1(0) > -|o p (l)| and L(9) < 0. It follows that L(9) = o p (l). 

Notice |S M | = |E ee | • \I r + MffMT^K\. But \I r + Mf/A'S^A] = O(AT). 
Similarly |£ w (0*)| = |E*J • | J r + M^A*'E*~ 1 A* | , thus uniformly on G, 

~7f ln|Szz| + ^ = HEeel + ^ In|S* e l + O (^) ■ 

Next, from E zz (0*) = A*M* ff A*' + E* e , we have E„(0*)Ej£ = 
A*M* f A*' x E- 1 + E^E- 1 . Using the formula for E" 1 , we have tr (E^E" 1 ) = 
tr^S" 1 ) + 0(1), because tr [£* e £ ~ 1 A(Mjj + A'S^A^A'S- 1 ] = O(l). 
The latter follows since the matrix in the square bracket is bounded in norm 
by C^A'S^A x {Mjj + A'E^A) -1 )! < C 4 \\I r \\ due to the bound on erf 

and a* 2 . Thus divided by N, we have 

itrp^OE" 1 ] = Itr^M^E" 1 ] + Itr^E" 1 ) + o(J-\ . 

Notice ln|E ee | = 5Zj=i mo f an d ^(EggEjg 1 ) = X^!=i a T l a 1 '■> we nave P rove d 
that 

% = " i E ( ln ^ + $ ~ 1 " ln < 2 ) " i tr(A*M^A*'E~ 1 ) + O 

uniformly on G. By L{9) = o p (l), it follows that 
1 N / rr* 2 \ 1 

- ]v E ln ^ + - 1 - ln af - iv tr ( A * M // A * ,£ - ) A o. 

i=l V a j / 

A key observation is that both terms are nonpositive; it follows 

(A.3) ^W m ^ + ^_i_wf)4o, 

i=l V °i / 

(A.4) ItrCA'M^E-ijAo. 

Consider the function /(a;) = lnx H — | m<T r 2 ~~ 1- Given that < C~ 2 < 

cj? < C 2 < oo for C > 1, for any x S [C~ 2 , C 2 ], there exists a constant 6 (e.g., 
take b= ^), such that f(x) > b(x - a* 2 ) 2 . It follows 

°p<v = h ln " f + S - 1 - wf ) - 4 ^> 2 - fff )2 - 



22 



J. BAI AND K. LI 



The above implies 

(A.5) ^E^-< 2 ) 240 - 

i=i 

Now we turn to (A. 4). By (A.l), we have 
1 

N 



:tr(A*M) / A*'S- 1 ; 



= 1 tr(M^A*'[S~ 1 - t-yk{Mjj + A'S~ 1 A)~ 1 A / S~ 1 ]A*). 

From (M^ 1 + A'S^A)- 1 = (A'E^A)- 1 - {K!t^k)- x Mj}{Mj} + 
A 1 x E^A) -1 , we obtain 

— ti 

N V ff 



tr(A*M; f A*'E- ] 
1 



- - trlM^A^E-U* - m; / A*'E- 1 A(A'E- 1 A)- 1 A / E- 1 A*] 

+ trfM^A^E^ACA'E-U)- 1 ^/^-/ + A / Ej e 1 A) _1 A / Ej e 1 A*] 
Both expressions are nonnegative; by (A. 4), we must have 

(A.6) 1 tr(M; / A*'s- e 1 A* - m; / a* , s- 1 A(A , e- 1 A)- 1 A , ij- 1 a*) 4 



and 



— tr(Afjf / A* / E~ 1 A(A'Eg e 1 A) -1 



(A.7) 

x Mj}(Mjj + A'S^A^A'S^A*) A 0. 



By (A.5) and Lemma A.4 [6], ^ t^M^A^E-^A*) 4C*>0, say. From (A.6) 

ltr(M / * / A*'E- 1 A(A / E- 1 A)- 1 A / E- 1 A*) 4 C* > 

with the same C*. The preceding result and (A.7) imply 

Mj}{Mj} + A'E- 1 !)- 1 = op(l). 

By assumption, we confine Afjj on a compact set, that is, M/^ = O p (l). By 
the definition of G, we have G = o p {\). From H = G{I- MjjG)- 1 , we have 
-ff = o p (l). We obtain the following result: 

(A.8) G = o p (l); H = o p (l). 
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The matrix on the left-hand side of (A. 6) is semi-positive definite and is finite 
dimensional (r x r), its trace is o p (l) if and only if every entry is o p (l). Thus 
we have 

1 

N' 



^(M* ff A*'±-^h* - m; / a*'s- 1 a(a'e- 1 a)- 1 a's- 1 a*) 4 0. 



Pre-multiplying both sides by M*^ 1 gives 

(A.9) ^'E^A* - ^'t-^k't-^Mt^A* 4 0. 
The second term on the left-hand side can be rewritten as 

[ABACA'S- 1 !)- 1 ] (^A'E- 1 !) [(A / E~ 1 A)~ 1 A / I3^ 1 A*]. 

Let A=(A- A*)'£~ e l AH, where H = (A'E^A)" 1 . It follows that A^S" 1 x 
A(A'S f T e 1 A)~ 1 = (I r - A). So (A.9) is equivalent to 

Ia^a* - (I r - A)±A'±£k{I r - A)' 4 0. 

However, ^A^E^A* = iA*'E*~ 1 A* + o p (l) by Lemma A.4 [6] and (A.5), 
thus 

(A.10) ^A*'E: e _1 A* - (I r - A^Mt-^Ir - A)' 4 0. 

Because the first term is of full rank in the limit, the second term is also of 
full rank. This implies that I r — A in the limit is of full rank. 
Meanwhile, equation (A.9) can be expressed alternatively as 

^(A-A^E^A-A*) 

- A (A - A*)'E~ 1 A C^A't^Aj ^A'E-^A - A*) 4 0, 
which can also be written as, in terms of A, 

(A.ll) ^(A-a^'e-^A-a*)-^ (lA'E-^A^'Ao. 

Both (A. 10) and (A.ll) will be useful in establishing consistency. 

We now make use of the first-order conditions. The first-order condi- 
tion (2.7), by (A. 2), can be simplified as A'E~ 1 (M za - t zz ) = 0. This gives 

/ T T 

A'E^ 1 A*MffA*' + A* - ]T ft 4 + ? E e ^' A *' 

V t=i *=i 

+ ^ - Ke) + Ke ~ AM// A' - E ee 



0. 



t=l 
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For simplicity, we neglect the smaller-order term A'E^ee'. The jth column 
of the above equation can be written as (after some algebra), 

- Mjfiikt-^{k - A*)M* ff \* 
(A.12) + MjjHkt^K* f i ft*^ 

+ Mj}Hh!t£ (± £ e t /f A* - MjfHX — ia] - of) 

V t=i ) a j 
( N 1 1 T \ 

\i=l a i t=l / 

Consider the first-order condition (2.9). By the method analogous to the one 
in deducing (A.12), we have 

M f f - M* ff = -FA'S" 1 (A - A*)M*ff - M* ff (A - A*)'t^AH 

+ HA't^(A - A*)Mj f (A - A*)'t~yAH 



(A.13) 



+ HA't-jA* f I ft<\ t^KH 

( N N 1 1 T \ 

Vi=l i=l * •? t=l / 

* 1 

-i^^-A^^-af)^. 



i=i °i 



Substituting (A.13) into (A.12), we obtain 
Aj - A* = MjjM* ff (A - A*)% e 1 AH\* 



- M"/ .HAZE" 1 (A - A*)M* ff (A - A*) E~ AJTAy 

- Mj}HA't-jA* ( 1 £ tfejj E^AiTA* 



M ff 
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/ AT N T \ 

i H E E E^* - £( e ^*)] h\) 

\i=l j=l i 3 t=l / 

N 1 
i=l a i 

(A.14) + MjjHk't^h* J2 ft^ 

( N 1 1 T \ 

+ M 7fH E — E^* - E ( e ^jt)\ 

\i=l a i t=l J 

-MjjHX^-af). 
Consider (A. 13). The fifth term of the right-hand side can be written as 
Hk't-J f A J2 etfA ~ H^t-J f A J2 etfA A, 

where A= (A — A*)'T I ~^A.H is defined following (A. 9). The first term is 
11^1/2^1/211 . o p (r-i/2) by Lemma A.3(b) [6] and the second term is 

A- WN^H^W -O p {T~ l l 2 ). 

The fourth term is the transpose of the fifth. The sixth term is given in 
Lemma A. 3(d). The seventh term is bounded by \\H\\ ■ |Ei=i^?^i^i(^f ~~ 

a 2 )H\\. The term \\Y,i=i^fh^ 2 ~ is bounded by 2C 4 ^ due to 

i 

| J? (of — cr* 2 )| < 2C 4 because of the boundedness of a 2 , a* 2 . So the seventh 

term is o p (l) by (A. 8). Given these results, in terms of A, equation (A. 13) 
can be rewritten as 

M ff - Mff = -A'M*f f - M* f fA + A' M^ A + 1 1 iV 1 / 2 ^ 1 / 2 1 1 • OpiT' 1 / 2 ) 
(A.15) -A-WN^H^W-OpiT- 1 / 2 ) 

+ \\ N ^H 1 / 2 \\ 2 -0 P (T- 1 / 2 ) + o p (l). 
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However, by the definition of H, NH = (-^A'S" 1 !) -1 . Equation (A.10) 
yields (iA'S^A)- 1 = (I r - A)> {± h*>^ k*)-\l r - A) + o p (||I r - Af). 
So we have 



tr 



(J r - ^'(Ia^'E:- 1 ^ (/,. - A) + o p (||/ r - A|| 2 ) 



The right-hand side is at most O p (A 2 ), implying that W^^H^-^W = O p (A). 

Given the above result, we argue that the matrix A must be stochastically 
bounded. First, notice that the left-hand side of (A. 15) is stochastically 
bounded by Assumption D. So if A is not stochastically bounded, the right- 
hand side is dominated by A'M* f} A in view of HiV 1 / 2 ^ 1 / 2 !! = O p (A), But 
A'MJjA will be unbounded since Mjj is positive definite. A contradiction 

is obtained. Thus A = O p (l); it follows that WN 1 ' 2 H l ' 2 \\ = O p (A) = P {1). 
Given this result, we have 

(A.16) Mff - M* ff = -A'Mf f - MjfA + A'M* ff A + o p (l). 

Next consider (A. 14). The seventh to ninth terms of the right-hand 
side of (A.14) are all o p (l) by Lemma A. 3 [6] and 1 1 iV 1 / 2 ^ 1 / 2 1 1 =0^(1). 
The last term is bounded by 2C 3 ||M^ / 1 || • || J-Aj^ 1/2 || • ||if 1/2 ||. Since 

EjLill^^^ 1/2 || 2 = r > ll^z/ll = °p( 1 ) b y Assumption D and H = o p {\) 
by (A. 8), it follows that the last term is o p (l). The third and fourth terms 
are similar to the fourth and fifth terms in (A. 13), and hence are o p (l) due 
to A = O p (l), || A* || < C for all j and Mff being bounded. Using the same 
arguments for (A.16), we have 

(A.17) Xj - A* = MjjM*f f A\* - MJfA'MffAXj + o p (l). 

We next prove consistency by using the identification conditions. 
Under IC1: Since the identification condition is A* = [I r , A|']' and A = 
[ir, Ay, the first r x r upper block of A* is the same as that of A, that 

is, [\* 1 ,\* 2 ,...,\* r ] = [X 1 ,X 2 ,...,K]=Iv By (A.17) with Xj-X* = 0, j = 
1,2,. ..,r, 

Mff A - A' Mff A 4 0. 

We now attach a subscript to matrix A to signify which identification con- 
dition is used. For ICk (k = 1,2,..., 5), we use A^ to denote the corre- 
sponding A. So the above equation implies MffA\ — A\MJjA\ 4 0. Taking 
transpose, we have A[Mi f - A' 1 M f * f A 1 4 0. Thus Mj f A 1 - A[Mi f 4 0. 
Post-multiplying A\, we obtain MJ^A\ — A^MJ^Ai 4 0. But we also have 
MffAi - A[MffAx 4 0. Thus M* ff A\ - M* ff A x 4 0. Since M* ff is posi- 
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tive definite, we have A\(I r — A±) 4 0. Since we have proved that I r — A\ 
converges in probability to a nonsingular matrix, it follows that A\ — > 0. 

From (A.ll) and A x 4 0, we obtain ^(A - A*) / Ej e 1 (A - A*) 4 0, which 
is equivalent to (5.1a). From (A. 16) and A\ — > 0, we obtain Mff — Mff — > 0, 
which is (5.1c). This proves Proposition 5.1 under IC1. 

Under IC2: From the identification condition -^A'E^A = ■^A* / S*~ 1 A* = 
I r , by adding and subtracting terms, we have the identity 

1(A - A'/E^A + ^A'S-HA - A*) 

(A.18) 

= -^'(S- 1 - K~ l )A* + -(A - A^'S-^A - A*). 
By (A.5) and Lemma A.4 [6], the term -^A^E" 1 - E^A* is o p (l). Thus 

1(A - A*)'^ 1 ! + ±A% e l (A - A*) - i(A - A'/E^A - A*) 4 0. 
The above can be written in terms of matrix A (i.e., A 2 under IC2), 
A 2 + A> 2 - 1(A - A*)'E^ e 1 (A - A*) 4 0. 

With ^A'E-^A = J r , (A.ll) implies A 2 A' 2 - A(A - A*)'Eg e 1 (A - A*) 4 0. 
These two results imply A 2 + A' 2 — A 2 A' 2 4 0, which is equivalent to 

(A 2 -I r )(A 2 -I r )' -I r ^0. 

However, (A. 16) is equivalent to 

M ff - (A 2 - I r )'M} f {A 2 - I r ) 4 0. 

Under IC2, M^ is a diagonal matrix with distinct elements. Also, Mff is 
a diagonal matrix by restriction. Applying Lemma A.l [6] with Q = A 2 — I r , 
V = Mfj , and D = Mf f , we conclude that Q and thus A 2 — I r converge in 
probability to a diagonal matrix with elements being either —1 or 1. Equiv- 
alently, A 2 converges to a diagonal matrix with diagonal elements being 
either or 2. By assuming that A and A* have the same column signs, we 
rule out 2 as the diagonal element So A 2 = o p (l). The rest of the proof is 
identical to IC1, implying Proposition 5.1 under IC2. 

Under IC3: IC3 requires Mff = M* f = I r , and so by (A. 16), (A 3 -I r )(A 3 - 

J r )'-/ r 40. From (A.10), 

_L A *' E * e -i A * _ {As _ Ir y (± A's-i a) (A 3 - I r ) 4 0. 

Under IC3, -^A*'E*~ 1 A* is diagonal with distinct elements, and ^A'E^A 
is also diagonal by estimation restriction. The latter matrix has distinct di- 
agonal elements with probability 1. It follows that A3 — I r converges in 
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probability to a diagonal matrix with diagonal elements either 1 or —1 
by Lemma A.l [6] applied with Q = (A 3 - I r )' , V = ^A'E" 1 !, and D = 
-^A*'S*~ 1 A*. The remaining proof is identical to that of IC2 and hence 
omitted. So we have proved Proposition 5.1 under IC3. 

Under IC4: By the identification condition, both A^ and Ai are lower 
triangular matrices, where Ai is first r x r submatrix of A. Consider the 
first r equations of (A. 17), 

A; - Af - Mjj(M* ff A A - A' 4 Mf f A 4 )Al' A 0. 

By (A.16), we have M ff - M* ff + A' A M* } f + M* ff A A - A' i M* ff A A A 0, which 
can be rewritten as 

Mff - (I r - A A )'M* f f(I r - A A ) A 0. 
The above two equations imply 

(A. 19) (I r - A A )'M* ff (I r - A 4 )(A / 1 - A? ) — (I r — A A )'M} f A A kt A 0. 

Since both M ff and M ff are of full rank, M ff - (I r - A A )'M* ff (I r -A 4 )4o 
implies I r — A A is of full rank. Pre-multiplying [(J r — A^'Mjj] -1 , we obtain 

(A.20) (I r - A A )A[ - Af A 0. 

Since both Ai and A^ are lower triangular with diagonal elements all 1, both 
matrices are invertible. It follows that I r — A A — A*'(A^) _1 A 0. Since both Ai 
and A^ are lower triangular, we have I r — A A converges to an upper triangular 
matrix. However, Mff and M*^ are both diagonal matrices and invertible. 

For Mf f - (I r - A a )'Mjj (I r - A A ) A to hold, given that I r - A A is an upper 
triangular matrix, it implies that I r — A A converges to a diagonal matrix. 
Because both Ai and A* are matrices with diagonal elements 1, and given 

the asymptotic diagonality of A A , it follows by (A.20) that I r — A A A I r . So 
we have A A A 0. The remaining proof is the same as in IC1 and is omitted. 
This completes the proof for IC4. 

Under IC5: Both Mff and M*^ are identity matrices; it follows from (A.16) 

that (I r — A^)'(I r — ^5) — I r A 0. The derivation of (A.20) only involves the 
full rank of I r — A, so it is applicable for IC5, that is, (I r — ^5)^ — A' x A 0. 
Since both Ai and Ai are lower triangular and invertible, it follows that 
I r — A§ converges to an upper triangular matrix. Given this result and 
(I r — A^)'(I r — ^5) — I r A 0, it follows that A§ converges to a diagonal 
matrix with diagonal elements either or 2. By assuming that the column 
signs of A and A* are the same, we have A§ A 0. The remaining proof is the 
same as in IC1 and is omitted. So we have proved Proposition 5.1 under IC5. 
This completes the proof of Proposition 5.1. □ 
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The proofs for other results are provided in the supplement [6]. 
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SUPPLEMENTARY MATERIAL 

Supplement to "Statistical analysis of factor models of high dimension" 

(DOI: 10.1214/11-AOS966SUPP; .pdf). In this supplement we provide the 
detailed proofs for Theorems 5.1-5.4 and 6.1. We also give a simple and 
direct proof that the EM solutions satisfy the first order conditions. Remarks 
are given on how to make use of matrix properties to write a faster computer 
program. 
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